Dataset - Seoul (Districts), South Korea Air Pollution

API Location: https://aqicn.org/json-api/doc/

Fields:

  • Measurement_Date – Date and time of air pollution measurement
  • Country – Country of Origin of the air pollution reading (South Korea)
  • City – City of Origin of the Air Pollution Reading (Seoul)
  • District – District of origin of the air pollution reading (Seoul Districts)
  • Latitude – Latitude of air pollution measurement station
  • Longitude – Longitude of air pollution measurement station
  • PM10 – Size 10 Micrometer Diameter Particulate Matter (PM) Measurement (Mircrogram/m3)
  • PM2.5 – Sized 2.5 Micrometer Diameter Particulate Matter (PM) Measurement (Mircrogram/m3)

Imports

In [1]:
import json
import numpy as np
import pandas as pd
import requests as req

from scipy import stats
from datetime import datetime

Definitions

This functions reads the CSV data from part 1, specifically retrieving the latitude and longitude of the stations collecting the air pollution data, and outputs the coordinates in the form of a dict by each district in Seoul.

In [2]:
def getStationCoordinates(airData):
    
    # Get only unique values from each of the specified columns
    district = airData['District'].unique()
    lat = airData['Latitude'].unique()
    long = airData['Longitude'].unique()
    
    # Create dict containing each districts air station coordinates
    coorDict = {}
    for i in range(len(district)):
        coorDict.update({district[i]: {'lat': lat[i], 'long': long[i]}})

    return coorDict

Using the waqi API, this function retrieves air pollution information about each of the districts coordinates collected from the CSV data (gathered using the getStationCoordinates function above). Specifically getting the week forecast of PM10 and PM2.5 pollutants in the air.

In [3]:
def getStationData(coorDict, API_KEY_json_file):
    
    # API URL and API Key
    url = 'https://api.waqi.info/feed/'
    
    with open(API_KEY_json_file) as key:
        api_key = json.load(key)['API_KEY']
    
    # Create a dict containing air pollution information about each district using their coordinates
    stationData = {}
    for district in coorDict.keys():
        
        # Get the latitude and longitude of the district station
        lat = coorDict[district]['lat']
        long = coorDict[district]['long']
        
        # Create the URL to get the information from the API
        content = req.get(url + 'geo:' + str(lat) + ';' + str(long) + '/?token=' + api_key).json()
        
        # Create the dict to store all the data gathered from the API
        stationData.update({district: {'Coor': {'lat': content['data']['city']['geo'][0],
                                                'long': content['data']['city']['geo'][1]},
                                       'Data': {}}})
        
        # Parse through the forecast data gathered from the API
        # into the station data dict for later manipulation
        for val in ['pm10', 'pm25']:
            dates = []
            avgs = []

            for i in range(len(content['data']['forecast']['daily'][val])):
                dates += [datetime.strptime(content['data']['forecast']['daily'][val][i]['day'],
                                            '%Y-%m-%d').strftime('%m/%d/%Y')[1:]]
                avgs += [content['data']['forecast']['daily'][val][i]['avg']]

            stationData[district]['Data'][val] = {'Dates': dates, 'Avgs': avgs}

    return stationData

This functions takes the data gathered from the API and formats it into a dataframe

In [4]:
def createAirData_Dataframe(stationData):
    tempDfList = []
    
    # Create a dataframe for each of the districts in the station data dict
    for district in stationData.keys():
        newDf = pd.DataFrame({
            'Measurement_Date': stationData[district]['Data']['pm10']['Dates'],
            'Country': ['Republic of Korea'] * len(stationData[district]['Data']['pm10']['Dates']),
            'City': ['Seoul'] * len(stationData[district]['Data']['pm10']['Dates']),
            'District': [district] * len(stationData[district]['Data']['pm10']['Dates']),
            'Latitude': [stationData[district]['Coor']['lat']] * len(stationData[district]['Data']['pm10']['Dates']),
            'Longitude': [stationData[district]['Coor']['long']] * len(stationData[district]['Data']['pm10']['Dates']),
            'PM10': stationData[district]['Data']['pm10']['Avgs'],
            'PM2.5': stationData[district]['Data']['pm25']['Avgs'],
        })

        tempDfList.append(newDf)
    
    # Empty dataframe is created with the right column names
    tempDf = pd.DataFrame(columns=['Measurement_Date', 'Country', 'City', 'District',
                                   'Latitude', 'Longitude', 'PM10', 'PM2.5'])
    
    # Loops through the temp dataframe list and appends them all into one dataframe
    airData = tempDf.append([df for df in tempDfList])

    return airData

This function applies a filter to the reformatted data to remove any outliers that might skew the data. The filter checks to make sure that the data in the data column(s) (PM10 and PM2.5) is within -3 and +3 standard deviations away from the mean for that column.

In [5]:
def filterData(data):
    # The column(s) that hold numeric data
    dataCols = ['PM10', 'PM2.5']
    
    # Making sure that the data in data columns is numeric and not string
    for col in dataCols:
        data[col] = pd.to_numeric(data[col])

    # This applies a filter to all the data columns of the dataframe:
    # * For each column, it first computes the Z-score of each value 
    #   in the column relative to the column mean and standard deviation.
    # * If the score is not within -3 and +3 standard deviations away from the mean for that 
    #   column, then the record is filtered out of the dataframe (thus removing the outliers)
    filteredData = data[(np.abs(stats.zscore(data[dataCols])) < 3).all(axis=1)]
    
    # This filter removes any data that is less than zero because
    # the measurement of pollutants in the air cannot go below zero
    filteredData = filteredData[(filteredData[dataCols] >= 0).all(axis=1)]

    print('Total number of rows BEFORE data is removed: {:,}\n Total number of rows AFTER data is removed: {:,}\n'
          '====================================================\n\t       Total number of rows removed: {:>3,}'
          .format(len(data.index), len(filteredData.index), len(data.index) - len(filteredData.index)))
    
    return filteredData

This function calculates the AQI (Air Quality Index) value, typically calculated from PM2.5 and determines its risk level. The AQI formula, value ranges and risk levels were all taken from the EPA (Environmental Protection Agency) of the USA.

In [6]:
def calculate_AQI(airData):
    airData = airData.reset_index(drop=True)
    
    aqiValues = []
    riskLevel = []

    for i in range(len(airData)):
        i_low = 0
        i_high = 0

        c_low = 0
        c_high = 0

        # PM2.5 AQI Value Calculation
        if 0 <= airData['PM2.5'][i] <= 12:
            c_low = 0
            c_high = 12

            i_low = 0
            i_high = 50
            
        elif 12.1 <= airData['PM2.5'][i] <= 35.4:
            c_low = 12.1
            c_high = 35.4

            i_low = 51
            i_high = 100
            
        elif 35.5 <= airData['PM2.5'][i] <= 55.4:
            c_low = 35.5
            c_high = 55.4

            i_low = 101
            i_high = 150
            
        elif 55.5 <= airData['PM2.5'][i] <= 150.4:
            c_low = 55.5
            c_high = 150.4

            i_low = 151
            i_high = 200
            
        elif 150.5 <= airData['PM2.5'][i] <= 250.4:
            c_low = 150.5
            c_high = 250.4

            i_low = 201
            i_high = 300
            
        elif 250.5 <= airData['PM2.5'][i] <= 350.4:
            c_low = 250.5
            c_high = 350.4

            i_low = 301
            i_high = 400
            
        elif 350.5 <= airData['PM2.5'][i] <= 500.4:
            c_low = 350.5
            c_high = 500.4

            i_low = 401
            i_high = 500
        
        # AQI Formula
        aqiValues += [int(round(((i_high - i_low) / (c_high - c_low)) * 
                                (airData['PM2.5'][i] - c_low) + i_low, 0))]

        # Determine AQI Risk Level
        if 0 <= aqiValues[i] <= 50:
            riskLevel += ['Good']
        elif 51 <= aqiValues[i] <= 100:
            riskLevel += ['Moderate']
        elif 101 <= aqiValues[i] <= 150:
            riskLevel += ['Unhealthy for Sensitive Groups']
        elif 151 <= aqiValues[i] <= 200:
            riskLevel += ['Unhealthy']
        elif 201 <= aqiValues[i] <= 300:
            riskLevel += ['Very Unhealthy']
        elif 301 <= aqiValues[i] <= 500:
            riskLevel += ['Hazardous']
    
    # Add the AQI values to the data frame
    airData['AQI_(PM2.5)'] = aqiValues
    airData['AQI_Risk_Level'] = riskLevel

    return airData

API Data Manipulation

Get Air Station Coordinate (Latitude and Longitude) list from CSV Data

In [7]:
airData = pd.read_csv('CSV-Air_Pollution_Data-(Reformed_and_AQI_Values).csv')

coorDict = getStationCoordinates(airData)

print('Total Number of District Stations: ' + str(len(coorDict.keys())) + '\n')

for district in coorDict.keys():
    print('District: {:>15}, Coordinates: '.format(district) + 
          str(round(coorDict[district]['lat'], 8)) + ' (Latitude), ' + 
          str(round(coorDict[district]['long'], 8)) + ' (Longitude)')
Total Number of District Stations: 25

District:       Jongno-gu, Coordinates: 37.5720164 (Latitude), 127.0050075 (Longitude)
District:         Jung-gu, Coordinates: 37.5642629 (Latitude), 126.9746757 (Longitude)
District:      Yongsan-gu, Coordinates: 37.5400327 (Latitude), 127.00485 (Longitude)
District:    Eunpyeong-gu, Coordinates: 37.6098232 (Latitude), 126.9348476 (Longitude)
District:    Seodaemun-gu, Coordinates: 37.5937421 (Latitude), 126.9496787 (Longitude)
District:         Mapo-gu, Coordinates: 37.5555803 (Latitude), 126.9055975 (Longitude)
District:    Seongdong-gu, Coordinates: 37.5418642 (Latitude), 127.0496589 (Longitude)
District:     Gwangjin-gu, Coordinates: 37.5471803 (Latitude), 127.0924929 (Longitude)
District:   Dongdaemun-gu, Coordinates: 37.5757428 (Latitude), 127.0288848 (Longitude)
District:     Jungnang-gu, Coordinates: 37.5848485 (Latitude), 127.0940229 (Longitude)
District:     Seongbuk-gu, Coordinates: 37.6067189 (Latitude), 127.0272794 (Longitude)
District:      Gangbuk-gu, Coordinates: 37.6479299 (Latitude), 127.0119518 (Longitude)
District:       Dobong-gu, Coordinates: 37.6541919 (Latitude), 127.0290879 (Longitude)
District:        Nowon-gu, Coordinates: 37.6587743 (Latitude), 127.0685054 (Longitude)
District:    Yangcheon-gu, Coordinates: 37.5259388 (Latitude), 126.8566029 (Longitude)
District:      Gangseo-gu, Coordinates: 37.54464 (Latitude), 126.8351506 (Longitude)
District:         Guro-gu, Coordinates: 37.4984981 (Latitude), 126.8896924 (Longitude)
District:    Geumcheon-gu, Coordinates: 37.4523569 (Latitude), 126.9082956 (Longitude)
District: Yeongdeungpo-gu, Coordinates: 37.5250065 (Latitude), 126.8973705 (Longitude)
District:      Dongjak-gu, Coordinates: 37.4809167 (Latitude), 126.9714807 (Longitude)
District:       Gwanak-gu, Coordinates: 37.4873546 (Latitude), 126.927102 (Longitude)
District:       Seocho-gu, Coordinates: 37.5045471 (Latitude), 126.9944578 (Longitude)
District:      Gangnam-gu, Coordinates: 37.5175282 (Latitude), 127.0474699 (Longitude)
District:       Songpa-gu, Coordinates: 37.5026857 (Latitude), 127.0925092 (Longitude)
District:     Gangdong-gu, Coordinates: 37.5449625 (Latitude), 127.1367917 (Longitude)

Get Air Station Data Using the API

In [8]:
# Replace WAQI_API_KEY.json with where your API Key is stored and make sure
# the key is specified as '"API_KEY": "<your key here>""' in the json file
stationData = getStationData(coorDict, 'WAQI_API_KEY.json')

# Just the first district in the dict to show the data's structure
stationData[list(stationData.keys())[0]]
Out[8]:
{'Coor': {'lat': 37.572025, 'long': 127.005028},
 'Data': {'pm10': {'Dates': ['7/15/2020',
    '7/16/2020',
    '7/17/2020',
    '7/18/2020',
    '7/19/2020',
    '7/20/2020',
    '7/21/2020',
    '7/22/2020',
    '7/23/2020',
    '7/24/2020'],
   'Avgs': [15, 24, 24, 26, 27, 49, 64, 57, 32, 30]},
  'pm25': {'Dates': ['7/15/2020',
    '7/16/2020',
    '7/17/2020',
    '7/18/2020',
    '7/19/2020',
    '7/20/2020',
    '7/21/2020',
    '7/22/2020',
    '7/23/2020',
    '7/24/2020'],
   'Avgs': [40, 75, 77, 84, 81, 125, 159, 151, 101, 93]}}}

Create Dataframe to Hold Station Data

In [9]:
airData = createAirData_Dataframe(stationData)

airData.head()
Out[9]:
Measurement_Date Country City District Latitude Longitude PM10 PM2.5
0 7/15/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 15 40
1 7/16/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 24 75
2 7/17/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 24 77
3 7/18/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 26 84
4 7/19/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 27 81

Remove Outliers and Invalid Values from API Data

In [10]:
filteredData = filterData(airData)
Total number of rows BEFORE data is removed: 232
 Total number of rows AFTER data is removed: 232
====================================================
	       Total number of rows removed:   0

Calculate AQI Data

In [11]:
airData = calculate_AQI(filteredData)

# Two new columns added: AQI_(PM2.5) and AQI_Risk_Level
print('New columns: ' + ', '.join(list(airData.keys())))
New columns: Measurement_Date, Country, City, District, Latitude, Longitude, PM10, PM2.5, AQI_(PM2.5), AQI_Risk_Level
In [12]:
airData.head()
Out[12]:
Measurement_Date Country City District Latitude Longitude PM10 PM2.5 AQI_(PM2.5) AQI_Risk_Level
0 7/15/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 15 40 112 Unhealthy for Sensitive Groups
1 7/16/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 24 75 161 Unhealthy
2 7/17/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 24 77 162 Unhealthy
3 7/18/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 26 84 166 Unhealthy
4 7/19/2020 Republic of Korea Seoul Jongno-gu 37.572025 127.005028 27 81 164 Unhealthy

Output Reformed API Data to CSV

In [13]:
print('Final size of data: {:,} columns and {:,} rows'.format(airData.shape[1], airData.shape[0]))
Final size of data: 10 columns and 232 rows
In [14]:
airData.to_csv('Reformed_Data/API-Air_Pollution_Data-(Reformed_and_AQI_Values).csv', index=False)
In [ ]: